Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.6 - Filtering Data")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Filtering Data

Again another commonly used function in data analysis, filtering out unwanted rows.

Option 1 - where()

(
    pets
    .where(F.col('breed_id') == 1)
    .filter(F.col('color') == 'brown')
    .toPandas()
)
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown

What Happened?

Similar to the functions we have seen so far, there are multiple functioned that get alias to different names that perform the same transformation. IMO I perfor where as it's a bit more intuitive and closer to the sql syntax.

Note:

Notice how we don't have to wrap 1 or brown in a F.lit() function as these conditions are columnary expressions.

We will look into how to perform more complex conditions in 2.1.7 that contain more than 1 condition.

Option 2 - isin()

(
    pets
    .where(F.col('nickname').isin('King', 'Argus'))
    .toPandas()
)
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None

What Happened?

If you want to know if a column can be of many values then you can use the isin() function. This function takes in both a list of values of comma seperated values. This is again very similar to sql syntax.

Summary

  • We learnt of two filter functions in Spark where() and isin().
  • Using isin you can see if a column can contain multiple values.
  • These functions are named similarly to a sql language.

results matching ""

    No results matching ""